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Abstract. We consider the problem ol optimality, in a minimax sense, 
and adaptivity to the margin and to regularity in binary classification. 
We prove an oracle inequality, under the margin assumption (low noise 
condition) , satisfied by an aggregation procedure which uses exponential 
weights. This oracle inequality has an optimal residual: (log M/n)"'''^''"^' 
where k is the margin parameter, M the number of classifiers to aggre- 
gate and n the number of observations. We use this inequality first to 
construct minimax classifiers under margin and regularity assumptions 
and second to aggregate them to obtain a classifier which is adaptive both 
to the margin and regularity. Moreover, by aggregating plug-in classifiers 
(only logn), we provide an easily implementable classifier adaptive both 
to the margin and to regularity. 



1 Introduction 

Let {X, A) be a measurable space. We consider a random variable {X, Y) with 
values inA'xl— 1,1} and denote by tt the distribution of {X, Y). We denote by 
the marginal of vr on A" and rj{x) = P{Y = 1\X ^ x) the conditional proba- 
bility function of F = 1 given that X = x. We denote by £>„ = {Xi, yi)i=i,...,„, 
n i.i.d. observations of the couple (X,Y). 

We recall some usual notions introduced for the classification framework. A 
prediction rule is a measurable function / : X i — > { — 1, !}• The mis classification 
error associated to / is 

R{f) = p(r ^ f{x)). 

It is well known (see, e.g., ^21) that min/_R(/) — R{f*) =^ R*, where the 
prediction rule /* is called Bayes rule and is defined by 

r(a;)=sign(2r,(a;)-l). 

The minimal risk R* is called the Bayes risk. A classifier is a function, /„ = 
fn{X, Dn), measurable with respect to D„ and X with values in {—1, 1}, that 
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assigns to the sample _D„ a prediction rule fn{.,Dn) : X i — > { — 1,1}. A key 
characteristic of /„ is the value of generalization error E[i?(/„)]. Here 

i?(/„)=P(y^/„W|AO- 

The performance of a classifier /„ is measured by the value E[i?(/„) — R*] called 
the excess risk of /„. We say that the classifier /„ learns with the convergence 
rate </'(n), where {4>{n))n&i is a decreasing sequence, if there exists an absolute 
constant C > such that for any integer n, E[i?(/„) — R*] < C(j){n). Theorem 
7.2 of ^21 shows that no classifier can learn with a given convergence rate for 
arbitrary underlying probability distribution tt. 

In this paper we focus on entropy assumptions which allow us to work with 
finite sieves. Hence, we first work with a finite model for /*: it means that we 
take a finite class of prediction rules — {fi, . . . , /a/}- Our aim is to construct 
a classifier /„ which mimics the best one of them w.r.t. to the excess risk and 
with an optimal residual. Namely, we want to state an oracle inequality 



E 



R{fn)~R* <aomm{R{f)- R*) + Cj{M,n), (1) 



where ao > 1 and C > are some absolute constants and 7(M, n) is the residual. 
The classical procedure, due to Vapnik and Chervonenkis (see, e.g. ^21), is to 
look for an ERM classifier, i.e., the one which minimizes the empirical risk 

1 " 

2=1 

over all prediction rules / in !F, where TLe denotes the indicator of the set E. 
This procedure leads to optimal theoretical results (see, e.g. Chapter 12 of \1'2\). 
but minimizing the empirical risk (j^J is computationally intractable for sets 
of classifiers with large cardinality (often depending on the sample size n), 
because this risk is neither convex nor continuous. Nevertheless, we might base a 
tractable estimation procedure on minimization of a convex surrogate (f) for the 
loss ( J IS] I [Hi [H]i i22 and ,23,). A wide variety of classification methods in 
machine learning are based on this idea, in particular, on using the convex loss 
associated to support vector machines (^J, |21)'). 

4>{x) = max(0, 1 — x), 

called the hinge-loss. The risk associated to this loss is called the hinge risk and 
is defined by 

A(/)=E[max(0,l-y/(X))], 
for all / : A" i — > K.. The optimal hinge risk is defined by 

A*^MA{f), (3) 



where the infimum is taken over all measurable functions /. The Bayes rule /* 
attains the infimum in ^ and, moreover, denoting by R{f) the misclassification 



error of sign(/) for all measurable functions / with values in M, Zhang, cf. |29| . 
has shown that, 

R{f)-R* <A{f)~A*, (4) 

for any real valued measurable function /. Thus, minimization of the excess hinge 
risk A(f) — A* provides a reasonable alternative for minimization of the excess 
risk. In this paper we provide a procedure which does not need any minimization 
step. We use a convex combination of the given prediction rules, as explained in 
section 121 

The difficulty of classification is closely related to the behavior of the con- 
ditional probability function rj near 1/2 (the random variable \ti{X) — 1/2| is 
sometimes called the theoretical margin). Tsybakov has introduced, in j^HI, an 
assumption on the the margin, called margin (or low noise) assumption, 
(MA) Margin (or low noise) assumption. The probability distribution tt 
on the space X x { — 1,1} satisfies the margin assumption MA(k) with margin 
parameter 1 < k < +oo if there exists cq > such that, 

E{\fiX) - f*{X)\} < CO (i?(/) - R*)'^^ , (5) 
for all measurable functions f with values in { — 1, 1}. 

Under this assumption, the risk of an ERM classifier over some fixed class !F can 
converge to the minimum risk over the class with fast rates, namely faster than 
n~^^^ (cf. (25p. On the other hand, with no margin assumption on the joint 
distribution tt (but combinatorial or complexity assumption on the class T) , the 
convergence rate of the excess risk is not faster than n~^^'^ (cf. \l'2\). 
In this paper we suggest an easily implementable procedure of aggregation of 
classifiers and prove the following results: 

1. We obtain an oracle inequality for our procedure and we use it to show 
that our classifiers are adaptive both to the margin parameter (low noise 
exponent) and to a complexity parameter. 

2. We generalize the lower bound inequality stated in Chapter 14 of by 
introducing the margin assumption and deduce optimal rates of aggregation 
under low noise assumption in the spirit of Tsybakov j24j . 

3. We obtain classifiers with minimax fast rates of convergence on a Holder class 
of conditional probability functions 77 and under the margin assumption. 

The paper is organized as follows. In Section 2 we prove an oracle inequal- 
ity for our convex aggregate, with an optimal residual, which will be used in 
Section 13 to construct minimax classifiers and to obtain adaptive classifiers by 
aggregation of them. Proofs are given in Section 0] 

2 Oracle Inequality 

We have M prediction rules fi, . . . , fM- We want to mimic the best of them 
according to the excess risk under the margin assumption. Our procedure is 
using exponential weights. Similar constructions in other context can be found, 



e.g., in|n|,[2H|,II31,|21,in|,|IHl,|lZ|. Consider the following aggregate which 
is a convex combination with exponential weights of M classifiers, 



M 

„(") 



where 

(n) _ ^^V{Y.l=iyifj{Xi)) 



w 



, Vj = l,...,M. (7) 



Since /i, . . . , /m take their values in {—1, 1}, we have, 

(n) _ exp(-nA„(/j-)) 

Efc=iexp(-nA„(/fe)) 

for all G {1, . . . , M}, where 

1 " 

Mf) ^-Y] max(0, 1 - Y,f{X,)) (9) 
1=1 

is the empirical analog of the hinge risk. Since An{fj) = 2Rn{fj) for all j = 
1, . . . , M, these weights can be written in terms of the empirical risks of /j's, 

M ^ exp(-2ni?„(/,-)) vj = 1, . . . , M. 

Etiiexp(-2ni?„(A.)) 

Remark that, using the definition ijHJl for the weights, we can aggregate functions 
with values in R (like in theorem^ and not only functions with values in {— 1, 1}. 

The aggregation procedure defined by © with weights I© , that we can called 
aggregation with exponential weights (AEW), can be compared to the ERM one. 
First, our AEW method does not need any minimization algorithm contrarily 
to the ERM procedure. Second, the AEW is less sensitive to the over fitting 
problem. Intuitively, if the classifier with smallest empirical risk is over fitted 
(it means that the classifier fits too much to the observations) then the ERM 
procedure will be over fitted. But, if other classifiers in !F are good classifiers, 
our procedure will consider their "opinions" in the final decision procedure and 
these opinions can balance with the opinion of the over fitted classifier in 
which can be false because of its over fitting property. The ERM only considers 
the "opinion" of the classifier with the smallest risk, whereas the AEW takes 
into account all the opinions of the classifiers in the set T. The AEW is more 
temperate contrarily to the ERM. Understanding why aggregation procedure are 
often more efficient than the ERM procedure from a theoretical point of view is 
a deep question, on which we are still working at this time this paper is written. 
Finally, the following proposition shows that the AEW has similar theoretical 
property as the ERM procedure up to the residual (log M)/n. 



Proposition 1. Let M > 2 be an integer, /i, . . . , he M real valued Junctions 
on X . For any integers n, the aggregate defined in ^ with weights 0) /„ satisfies 



The following theorem provides first an exact oracle inequality w.r.t. the hinge 
risk satisfied by the AEW procedure and second shows its optimality among all 
aggregation procedures. We deduce from it that, for a margin parameter k > 1 
and a set of M functions with values in [— 1, 1], = {/i, . . . , /a/}, 



is an optimal rate of convex aggregation of M functions with values in [—1, 1] 
w.r.t. the hinge risk, in the sense of irH;. 

Theorem 1 (Oracle inequality and Lower bound). Let «; > 1. We assume 
that TT satisfies MA(k). We denote by C the convex hull of a finite set of functions 
with values in [—1,1], J-' — {/i,...,/m}- The AEW procedure, introduced in 
0) with weights ^ ( remark that the form of the weights in (0) allows to take 
real valued functions for the fj 's ), satisfies for any integer n > 1 the following 
inequality 



where Cq > depends only on the constants k and cq appearing in MA(k). 

Moreover, there exists a set of prediction rules T = {/i, . . . , /a/} such that for 
any procedure /„ with values in R, there exists a probability measure n satisfying 
MA (k) such that for any integers M, n with log M < n we have 



where Cq > depends only on the constants k and cq appearing in MA(k). 

The hinge loss is linear on [—1, 1], thus, model selection aggregation or convex 
aggregation are identical problems if we use the hinge risk and if we aggregate 
function with values in [—1, 1]. Namely, min/gjr A{f) = min/gc ^(/)- Moreover, 
the result of Theorem ^ is obtained for the aggregation of functions with values 
in [—1, 1] and not only for prediction rules. In fact, only functions with values in 
[—1,1] have to be considered when we use the hinge loss since, for any real valued 
function /, we have max(0, 1 — yi^{f{x))) < max(0, 1 — yf{x)) for all x G A", y G 
{-1, 1} where is the projection on [-1, 1], thus, A{^(f)) - A* < A{f) - A*. 
Remark that, under MA(k), there exists c > such that,E [\f{X) - f*{X)\] < 
c{A{f) - A*fl'^ioT all functions / on A" with values in [-1, 1] (cf. [TH|) . The 
proof of Theorem Q is not given here by the lack of space. It can be found in 
[TH] . Instead, we prove here the following slightly less general result that we will 
be further used to construct adaptive minimax classifiers. 



mill + 



log(M) 



n 




E A{f^)~A* <min(A(/)-A*)+Co7(^,^,?^,«;), 



E [A(/„) - A*\ > min(A(/) - A*) + C'^^{T , ^, n, k), 



Theorem 2. Let k > 1 and let J- = {fi, . . . , /m} be a finite set of prediction 
rules with M > 3. We denote by C the convex hull of J-'. We assume that tt 
satisfies MA(k). The aggregate defined in ^ with the exponential weights ^ 
( or ) satisfies for any integers n, M and any a > the following inequality 



E 



A{fn) - A* 



<{l + a)^MA{f)-A*) + C 



IokM 



where C > is a constant depending only on a. 

Corollary 1. Let k > 1, M > 3 and {/i, . . . , /a/} be a finite set of prediction 
rules. We assume that tt satisfies MA(k). The AEW procedure satisfies for any 
number a > and any integers n, M the following inequality, with C > a 
constant depending only on a, 



E 



R{fn)-R* <2(l + a) mill {R{fj)~R*)+C 

J j=l,...,M 



logM 



We denote by V^, the set of all probability measures on X x {—1,1} satis- 
fying the margin assumption MA(k). Combining Corollary ^ and the following 
theorem, we get that the residual 



logM 



is a near optimal rate of model selection aggregation in the sense of ^H] when 
the underlying probability measure tt belongs to Vk- 

Theorem 3. For any integers M and n satisfying M < exp(n), there exists M 
prediction rules /i, . . . , /m such that for any classifier fn and any a > 0, we 
have 



sup 



2(1 + a) min 



E \R{fr,) - R* 
where Ci = c[J/(4e22''('=-i)/(2K-i)(log2)'"/(2K-i))^ 



log M \ 2 



3 Adaptivity Both to the Margin and to Regularity. 

In this section we give two applications of the oracle inequality stated in Corol- 
lary ^ First, we construct classifiers with minimax rates of convergence and 
second, we obtain adaptive classifiers by aggregating the minimax ones. Follow- 
ing we focus on the regularity model where rj belongs to the Holder class. 

For any multi-index s = (si, . . . , Sd) G N'' and any x = {xi, . . . , Xd) € K'*, we 
define \s\ = Ejti s\ = s^l . . . Sdl,x' = ...x'/ and \\x\\^ixl + .. . + xl)^/\ 
We denote by D'^ the differential operator ^ ^ ■ 



Let /3 > 0. We denote by [/3J the maximal integer that is strictly less than 
(3. For any x G (0, 1)'' and any [/3J -times continuously difFerentiable real valued 
function g on (0, 1)'', we denote by gx its Taylor polynomial of degree [f3\ at 
point X, namely, 

NI<L/3J ^' 

For all L > and /3 > 0. The (/?, L, [0, l]"^)— Holder class of functions, denoted 
by S{f3, L, [0, 1]'*), is the set of all real valued functions g on [0, 1]'' that are [/3J- 
times continuously differentiable on (0, 1)'' and satisfy, for any x,y & (0, 1)'', the 
inequality 

\giy)-gAy)\<L\\^-y\f- 

A control of the complexity of Holder classes is given by Kolmogorov and 
Tikhomorov (1961): 

N {U{f3, L, [0, 1]"^), e, L-([0, 1]'')) < A{/3, d)e- ^ Ve > 0, (10) 

where the LHS is the e— entropy of the (/3, L, [0, 1]'')— Holder class w.r.t. to the 
L°°{[0, 1]'')— norm and A{(3, d) is a constant depending only on f3 and d. 

If we want to use entropy assumptions on the set which 77 belongs to, we 
need to make a link between and the Lebesgue measure, since the distance 
in (|10|l is the norm w.r.t. the Lebesgue measure. Therefore, introduce the 
following assumption: 

(Al) The marginal distribution on X of tt is absolutely continuous w.r.t. the 
Lebesgue measure Xd on [0, 1]*^, and there exists a version of its density which is 
upper bounded by fimax < 00. 

We consider the following class of models. For all k > 1 and (3 > 0, we denote 
by 'Pk,/3, the set of all probabihty measures n on X x {—1,1}, such that 

1. MA(k) is satisfied. 

2. The marginal P"^ satisfies (Al). 

3. The conditional probability function rj belongs to Z'(/3, L, M''). 

Now, we define the class of classifiers which attain the optimal rate of con- 
vergence, in a minimax sense, over the models V^^p. Let n > 1 and f3 > 0. For 
any e > 0, we denote by SeiP) an e-net on X!{f3,L, [0,1]^^) for the L°°— norm, 
such that, its cardinal satisfies log Card {IJ^{j3)) < A{P, d)e~'^/^ . We consider the 
AEW procedure defined in 10, over the net ^^{f}) : 

/«= E "''"^ (/,)/„ where /,(x)-2I(,(,)>i/2)-l. (11) 

Theorem 4. Let n > 1 and /3 > 0. Let ai > be an absolute constant and 
consider e„ = ain fi{2t-i)+d{t^-i) ^ xhe aggregate with e = e„, satisfies, for 
any tt G V^.p and any integer n > I, the following inequality 

\R{f:r)-R*] < C2(«:,/3,d)n-M--™-T), 



whereC2{K,(3,d) = 2 max (4(2coMma.)"/("-i), C^(/3, d)^) (ai)^V(ai)~ «^ 
and C is the constant appearing in Corollary^ 

Audibert and Tsybakov (cf. ^]) have shown the optimality, in a minimax sense, 
of the rate obtained in theorem 0] Note that this rate is a fast rate because it 
can approach 1/n when k is close to 1 and /3 is large. 

The construction of the classifier needs the knowledge of k and (3 which 
are not available in practice. Thus, we need to construct classifiers independent 
of these parameters and which learn with the optimal rate 7j-/3'*/(/3(2K-i)+<i(K-i)) 
if the underlying probability measure tt belongs to 'Pk./j, for different values of k 
and p. We now show that using the procedure © to aggregate the classifiers 
for different values of e in a grid, the oracle inequality of Corollary ^ provides 
the result. 

We use a split of the sample for the adaptation step. Denote by the 

(2) 

subsample containing the first m observations and Dl the one containing the 
l{— n — m) last ones. Subsample is used to construct the classifiers for 

(2) 

different values of e in a finite grid. Subsample Dl is used to aggregate these 
classifiers by the procedure We take 



and m = n — I. 



logn 

Set A = logn. We consider a grid of values for e: 

e(")-|<^„,fe-^ :fce{i,...,L^/2j} 

For any e Q{n) we consider the step em-* = m~'^ . The classifier that we propose 
is the sign of 

</>6e(n) 

where F^{x) = sign(/4j(x)) is the classifier associated to the aggregate for 
all e > and the weig htsM;W(F) are the ones introduced in iQ constructed with 
the observations for all F £ ^{n) = {sign(/^J : e = m^"*, (j) G Q{n)}: 



The following Theorem shows that /^''^ is adaptive both to the low noise expo- 
nent K and to the complexity (or regularity) parameter /3, provided that (k, /3) 
belongs to a compact subset of (1, +00) x (0, +00). 

Theorem 5. Let K be a compact subset of (l,+oo) x (0,+oo). There exists a 
constant C3 > that depends only on K and d such that for any integer n> 1, 
any {k, P) G K and any tt G 7^k.,3j we have, 



< Can /3(2»-i)+d(« 



Classifiers are not easily implementable since the cardinality of 17^^ {(3) is 
an exponential of n. An alternative procedure which is easily implementable is 
to aggregate plug- in classifiers constructed in Audibert and Tsybakov (cf. jTj). 

We introduce the class of models V'^ p composed of all the underlying prob- 
ability measures tt such that: 

1. TT satisfies the margin assumption MA(k). 

2. The conditional probability function 77 S S{(3, L, [0, 1]'^). 

3. The marginal distribution of X is supported on [0, 1]'* and has a Lebesgue 
density lower bounded and upper bounded by two constants. 

Theorem 6 (Audibert and Tsybakov (2005)). Let k > 1,13 > 0. The excess 
risk of the plug-in classifier fn^^ — 21L^^(ij)^^^^y — 1 satisfies 



sup J 



)) - R* 



/3k 



where is the locally polynomial estimator of r]{-) of order [/3J with band- 

width h = n^W+d and C4 a positive constant. 

In PP, it is shown that the rate n~ («-i)(23+d) is minimax over V'^^ ^, if /3 < 1). 
Remark that the fast rate can be achieved. 

We aggregate the classifiers f^'^ for different values of (3 lying in a finite 
grid. We use a split of the sample to construct our adaptive classifier: I = 
\n/ logn] and m = n~ I. The training sample D]^ = ((Xi, Yi), . . . , (X^, Ym)) 
is used for the construction of the class of plug-in classifiers 

T = |/(f^) : f3u = ^^^ke {1, . . . , lA/2\}'j , where A = logn. 

The validation sample Df — {{X,yi+i,Y,n+i), ■ ■ ■ , {Xn, Yn)) is used for the con- 
struction of weights 

, e^p{Etm+iY,f{X.)) 

E/e^exp(EL™+i>^J(^0)' ^ 

The classifier that we propose is F^'^p = sign(/^'*^'), where: f^'^P — J2fej^ (/)/■ 



Theorem 7. Let K be a compact subset o/(l,-|-oo) x (0,+oo). There exists a 
constant C5 > depending only on K and d such that for any integer n > 1, 

we have, 



any (k, /3) e K , such that (3 < d{K 



- I), and any ir € V^ /j, 

R*] < C'5n"(»-iK2'3+d) . 



Adaptive classifiers are obtained in Theorem © and by aggregation of only 
logn classifiers. Other construction of adaptive classifiers can be found in |17|. 
In particular, adaptive SVM classifiers. 



4 Proofs 



Proof of Proposition^ Using the convexity of the hinge loss, we have ^n(/n) ^ 



J2f=i uij^nUj)- Denote by i = argmini^i^...^M we have A^ifi) = 

i (log('u;-) — log(wi)) for alH = 1, . . . , M and by averaging over the Wi we get : 

log(M) 



min {fi) + 

i— 1,...,_A/ 



(12) 



where we used that Yl!jLi log ( i7m) ~ K{'w\u) > where K{w\u) denotes the 
Kullback-Leiber divergence between the weights w = (wj)j=i^...^Ai and uniform 
weights u — {l/M)j=i^,,,^M ■ 

Proof of Theorem |2j Let a > 0. Using Proposition Q we have for any 
f E J- and for the Baycs rule /*: 

A{U) -A* = (1 + a)(A„(/„) - AM*)) + A{fn) ^ A* - (1 + a)(A„(/„) - AM*)) 

< il+a){AM)-Mr)) + il+a)^^+AiU)-A*-{l+a)iAMn)-AM*))- 
Taking the expectations, we get 

E 

+E [A{U) -A*~{1 + a)(A„(/„) - AM*)) • 
The following inequality follows from the linearity of the hinge loss on [—1, 1]: 
A(/„)-A*-(l+a)(A„(/„)-A„(/*)) < max - A* - (1 + a)(A„(/) - AM*))] 

Thus, using Bernstein's inequality, we have for all < (5 < 4 + 2a : 

P [a(/„) - a* - (1 + a){AMn) - Anif*)) > S 

5 + a{A{f)-A*) 



Aifn)~A* < (l+a)min(A(/)-A*) + (l+a)(logM)/n 



< cxp 



A{f) - A* - (AM) - AniD) > 



1 + a 

n{S + a{Aif) - A*)y 



2(1 + a)^A{f) - A*y/^ + 2/3(1 + a)((5 + a{A{f) - A*)) 



There exists a constant ci > depending only on a such that for all < (5 < 4+2a 
and all f G we have 



(5 + a(A(/)~A*))2 



2(1 + a)2(A(/) - + 2/3(1 + a){S + a{A{f ) - A*)) 



Thus, 



A(/„) - A* - (1 + a)(A„(/„) - A„(/*)) > 6 < Af exp(-nci52-i/-). 



Observe that an integration by parts leads to J^°° exp {—bt°') dt < '^^^ba^'-^ • 



for any a > 1 and a, 6 > 0, so for all u > 0, we get 



A{U) -A*-{1 + a){AMn) - AMI) 



<2u + M 



exp[—nciu 



2-1/k 



nciu 



1-1/k 



If we denote by /i(M) the unique solution of X = M exp(— X), we have log M/2 < 
n{M) < logM. For u such that nciit^"^/" = l^-iM), we obtain the result. 

Proof of Corollarv 111 We deduce Corollary ^ from Theorem |21 using that 
for any prediction rule / we have A{f) — A* = 2{R{f) — R*) and applying 
Zhang's inequality A{g) ~ A* > {R{g) - R*) fulfilled by aU g from X to R. 

Proof of Theorem |3l For all prediction rules /i , . . . , /m , we have 



sup inf sup I E 

/i,...,/m fn. TTS-P, 



Rifn) - R* 



2(1 + a) min (R(f,) 

j = l,...,M 



R* 



> inf sup I E 

U Tr£V^:fe{fu---jM} ^ 



Rifn) - R* 



Thus, we look for a set of cardinality not greater than M, of the worst proba- 
bility measures tt G Vk, from our classification problem point of view and choose 
/ii • ■ ■ J fn as the corresponding Bayes rules. 

Let N be an integer such that 2^~^ < M. Let xi, . . . ,xn be TV distinct 
points of X. Let < w < 1/N. Denote by the probability measure on X 
such that P^{{xj}) = for j = 1, . . . , TV - 1 and P^{{xn}) = 1 - (N ~ l)w. 
We consider the set of binary sequences J? = { — 1, Let < h < 1. For all 
a € f2 we consider 



V<y{x) 



(l + (Tjh)/2 ifa; = xi,. 
1 if a; = XAT. 



For all (T e i7 we denote by tto- the probability measure onA:'x{ — 1,1} with the 
marginal P^ on X and with the conditional probability function rjc of F = 1 
knowing X. 

Assume that k > 1. We have f {\2-q„{X) -l\ <t) = {N - l)wll{,i<f}, VO < 
t < 1. Thus, if we assume that {N-l)w < then P (|2r7^(X) - 1| < i) < 

for all t>0, and according to |23, tto- belongs to MA(k). 

We denote by p the Hamming distance on f2 (cf. [20] P-88). Let a, a' be such 
that p{a,(j') = 1. We have 

H2 (^«„^ = 2 (l - (1 - 7^(1 - VT~^))") . 

We take w and h such that w{l - Vl - h^) < 1/n, thus, (tt^",?!®,") < (3 = 
2(1 — e~^) < 2 for any integer n. 

Let /„ be a classifier and a G SI. Using MA(k), we have 



R{.fn)~R* >{C0W)''E^^ 



By Jensen's Lemma and Assouad's Lemma (cf. we obtain: 



inf sup 



RUn) -R* )> M 



N - 1 



(1-/3/2)= 



We obtain the result by taking w = [nh?) ^, N = [log A// log 2] and h ~ 

(n-iriogM/log2l)^'^-''/^''^-'\ 

For K = 1, we take h = 1/2, thus \2r]^iX) - 1| > 1/2 a.s. so tt^ eMA(l) 
(cf.[2S])- Putting w = 4/n and N — [log Af/ log 2] we obtain the result. 

Proof of Theorem |31 According to Theorem ^ where we set a = 1, we 
have, for any e > 0: 



R{fn)-R* <4 min + C 



logCardZ'e(/3) 



Let 77 be a function with values in [0, 1] and denote by / = lIf;>i/2 the plug-in 
classifier associated. We have \2r] — l|lljr_^^. < 2\fi — r]\, thus: 

R{!) - i?* = E [\2ti{X) - l\^f^f.] - E [\2i^{X) - 



< I ||2r, - 1 E [I^-^^.] < 1 1 |2ry - 1|%^^.. 

and assumption (Al) lead to 

R{fv) - R* < i2colImax)^\\fl - V\\L^\lo,l]d)- 



CO - i?*) " , 



Hence, for any e > 0, we have 



Rifn) - R* 



< D e— 



,-d/0\ 2^ 



where D = max {4:{2cQfi.maxT^'-'^~^\CA{(3, d)^). For the value 



ain (;i(2«-i)+<i(»-i) . 



we have 



E. 



R{f:-)-R* 



< Cin /3(2»-i)+d(K-i) 



where Ci = 2i:)(ai)^ V (ai) 

Proof of Theorem |S1 We consider the following function on (l,+oo) x 
(0, +00) with values in (0, 1/2): 



0K/3) 



(5{2k- l) + d{K- 1) 



For any n greater than ni — ni{K), we have A ^ < (f){K,(3) < lA/2\ A ^ for 
all (k, P) G K. 



Let {kq,Pq) e K. For any n > ni, there exists fco G {Ij ■ • • , L^/2J ^ 1} such 



that 



(j)ko = koA-^ < 0(ko,/3o) < (fco + 1)^"^ 



We denote by /ko(-) the increasing function (/)(ko,-) from (0, +00) to (0,1/2). 
We set 

There exists m = rn(i^) such that m|/3o - /3o.n| < |/ko(/3o) - /Ko(/3o.n)l < 

Let TT G Vk.0,00- According to the oracle inequahty of Corollary ^ we have, 
conditionally to the first subsample D}^: 



Rif^^n -R*\D^^ < 4 min Rif^" ) - R 
4>eG(n) 



-C 



logCard(g(n)) 
I 



Using the definition of I and the fact that Card(t/(n)) < logn we get that there 
exists C > independent of n such that 



< C 



R{fm ) - R* 



log n 



Moreover /3o,n < /^Oj hence, 7^ko,;3o ^ ^Ko,/3o,n- Thus, according to Theorem 
0] we have 



E. 



RUm ) - -R* 



< Ci{K, d)m 



13k 



where Ci{K,d) = max (Ci(k, /?, d) : G iC) and V'(k,/3) = ^(2«,_i)+d(^_i) ■ 

By construction, there exists A2 = y42(if, d) > such that |^(Ko,/3o,n) ^ 
V'(koi/9o)| < A2A^^. Moreover for any integer n we have = exp(A2), 

which is a constant. Wc conclude that 



E, 



Rit'n-R* <C2{K,d) I n 



0) ^ ri^^^ 



where C2{K, d) > is independent of n. We achieve the proof by observing that 
V'(«o,/3o) < 2;^. 

Proof of Theorem 13 We consider the following function on (l,+oo) x 
(0, +00) with values in (0, 1/2): 



(5k 



(k- l)(2/3 + d) 



For any n greater than ni ~ ni{K), we have A ^ < 0{K,f3) < [A/2\ A ^, for 
aU (k, P) G K. 

Let (ko,/3o) G be such that /3o < (kq — l)d. For any n > rii, there exists 
A;o G {1, . . . , [Z\/2J - 1} such that koA'^ < 0(ko,Po) < (fco + l)A-\ 



Let TT e Vkq.Pq. According to the oracle inequality of Corollary ^ we have, 
conditionally to the first subsample I?,^„: 



R{FfP) - R*\Dl < 4min(i?(/) - i?*) + C ( 



I 



Using the proof of Theorem O we get that there exists C > independent of n 
such that 



E^ 



Rifn^n - R* 



< c 



Rifl^'"^)-R* 



2 „\ 2^„-l 



log n 



Moreover f3ko < Po, hence, V^g^pg C VK.o,i3kg - Thus, according to TheoremEl 
we have 



E. 



Rifrn'"^) - R*] < C4{K,d)m-^'^^<>'f^''o\ 



where C4 (is', d) = max (C4(k, /3, d) : [k, f3) € if). We have |6)(ko, /3fco)-0(Ko, /?o)| < 
by construction. Moreover ri^/'°s" = e for any integer n. We conclude that 



E^ 



R{K'^P)-R* < Ci{K,d) I n 



+ 



2 „ \ 2K.n-l 



log n 



where Ci{K, d) > is independent of n. We achieve the proof by observing that 
0(«o,/?o)< 27^,if/3o<(^o-l)d. 
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